Analyzing the open access status of scholarly articles is a challenging task that many libraries, universities and researchers are facing. In this blog post, we discuss a Python based approach for analyzing journal articles using the Unpaywall database. In particular, we build upon the results of an existing blog article which used R and provide a new python implementation.
The Unpaywall database is a collection of millions of scholarly fulltext articles that were harvested from a variety of different repositories and services like PubMed Central and DOAJ. Although Unpaywall offers a REST API to query the Unpaywall database, it is more convenient and efficient to use the provided database snapshots, which are usually released twice a year.
In this blog post, we discuss the results from the blog article “Open Access Evidence in Unpaywall” by Najko Jahn and Anne Hobert by comparing different database snapshots obtained from Unpaywall. Since the article was published on May 7th, 2019, Unpaywall has recently released a new data dump with up-to-date information about scholarly journal articles. Therefore it might be interesting to see how the change in data impacts the results of our analysis compared to the previous blog article. For this post we are using the Unpaywall data dump from November 2019.
In order to query the Unpaywall dataset, which is stored in Google BigQuery, we use the google-cloud-bigquery package. By default, this package is not shipped with pandas. However, we recommend to use this package along with pandas to have a better experience. The major reason for this being, that the queried data can easily be exported into a pandas DataFrame.
from google.cloud import bigquery
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.ticker as mtick
import seaborn as sns
import upsetplot
The setup for the Google BigQuery Python client is very simple. Firstly, we import bigquery from the google.cloud package and create a client object by passing the project name as a parameter. If we didn´t set our credentials for the Google Cloud service yet, Google asks us to verify our client. This can be done by simply exporting an API-Key provided by Google into the working environment or by installing the official Google Cloud SDK. A detailed description on how to authenticate the client can be found here. Notice that our project-database has restricted access.
client = bigquery.Client(project='api-project-764811344545')
According to the aforementioned blog article, the dataset being used has two tables, containing records between 2008 and 2018. For reusability, we define two variables which are holding the table´s names to call them in SQL queries. We are also using the improved string formatting syntax which is a new feature since Python 3.6.
# database snapshots
upw_08_12 = '`oadoi_full.mongo_export_upwNov19_08_12`'
upw_13_18 = '`oadoi_full.mongo_export_upwNov19_13_19`'
We can query the project-database by using the query method on our created client object. In this example we were requesting ten journal articles that were published in 2018. We can pass the SQL query simply as a string into our query method. Next we can chain the to_dataframe method on our query to get a pandas DataFrame.
client.query(f"""
SELECT *
FROM {upw_13_18}
WHERE year=2018
AND genre="journal-article"
LIMIT 10
""").to_dataframe()
#> oa_status ... data_standard
#> 0 gold ... 2
#> 1 bronze ... 2
#> 2 gold ... 2
#> 3 gold ... 2
#> 4 gold ... 2
#> 5 green ... 2
#> 6 green ... 2
#> 7 gold ... 2
#> 8 green ... 2
#> 9 gold ... 2
#>
#> [10 rows x 14 columns]
To contrast the previous results from the blog article with the recent results, we begin with a comparison of the total number of articles between the two datasets from February 2019 and November 2019. More importantly, we will investigate the open access share between these two.
Before we can compute the open access proportion in our dataset, we must query our database and count the number of distinct DOI´s by year and open access status. Since we have two tables, we need to concatenate our dataframes in the next step. We also have to convert the column which contains the year from string to datetime. This allows us to handle date informations much better when visualizing data. After we have calculated the proportion of open access by year, we will now sort the values by time.
oa_08_12 = client.query(f"""
SELECT year, is_oa, COUNT(DISTINCT(doi)) AS n
FROM {upw_08_12}
WHERE genre="journal-article"
GROUP BY year, is_oa
""").to_dataframe()
oa_13_18 = client.query(f"""
SELECT year, is_oa, COUNT(DISTINCT(doi)) AS n
FROM {upw_13_18}
WHERE year<2019 AND genre="journal-article"
GROUP BY year, is_oa
""").to_dataframe()
df = pd.concat([oa_08_12, oa_13_18])
df.year = pd.to_datetime(df.year.apply(lambda x: str(x) + "-01-01"))
df['prop'] = df.groupby(['year'])['n'].transform(lambda x: x / x.sum())
df = df.sort_values(by=['year']).reset_index(drop=True)
df.head()
#> year is_oa n prop
#> 0 2008-01-01 True 646079 0.312379
#> 1 2008-01-01 False 1422175 0.687621
#> 2 2009-01-01 True 730604 0.323450
#> 3 2009-01-01 False 1528180 0.676550
#> 4 2010-01-01 True 823574 0.328862
For visualization, we are using matplotlib. Matplotlib is a commonly used plotting library for Python that is influenced by Matlab and enables a wide range of different plot figures. Even though pandas allows us to directly call a plot method on a DataFrame, we feel more comfortable with using essential matplotlib methods. Especially, when customizing the figure.
In our first plot we visualize the open access share to journal articles over time. Whereas our plot is not interactive like in the stated blog article, we could make use of the same plotting library named Plotly which can be found here. Plotly is available for R, Javascript and Python.
x = df.year.unique()
y1 = df[df.is_oa == True].n
y2 = df[df.is_oa == False].n
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [10, 4]
plt.grid(False, which='both', axis='x')
plt.box(False)
plt.stackplot(x, y1, y2, colors=['#56B4E9', '#b3b3b3a0'], alpha=0.8)
plt.title('Open Access to Journal Articles', fontdict={'fontsize': 15, 'fontweight': 600}, \
pad=20)
plt.xlabel('Year published', labelpad=10, fontdict={'fontsize': 12, 'fontweight': 500})
plt.ylabel('Journal Articles', labelpad=10, fontdict={'fontsize': 12, 'fontweight': 500})
is_oa_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_oa_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
plt.legend(handles=[is_oa_patch, is_not_oa_patch], title='Is OA?', \
fontsize='medium', bbox_to_anchor=(1.2, 1.05), labelspacing=1.2)
plt.tight_layout()
plt.show()
Figure 1: Open access to journal articles according to Unpaywall.
As expected, the total number of journal articles has increased compared to the previous results from the blog article. In fact, the number of articles with a distinct DOI included in the Unpaywall data dump from November 2019 has increased by 2% in comparison to the February 2019 data dump used in the original blog post. In addition, the share of open access articles has increased by 6%. Overall, 13,569,137 freely available articles were published between 2008 and 2018.
Next, we investigate the differences between the distribution of the host types specified in the data dumps. The host type describes the type of location that serves open access full-texts and accepts two values: publisher and repository. As mentioned in the blog article, the host type variable is determined by Unpaywall’s algorithm.
HOST_TYPE_08_12_QUERY = f"""
SELECT year, host_type, journal_is_in_doaj,
COUNT(DISTINCT(doi)) AS number_of_articles
FROM {upw_08_12}, UNNEST (oa_locations)
WHERE genre="journal-article" AND is_best=true
GROUP BY year, host_type, journal_is_in_doaj
"""
HOST_TYPE_13_18_QUERY = f"""
SELECT year, host_type, journal_is_in_doaj,
COUNT(DISTINCT(doi)) AS number_of_articles
FROM {upw_13_18}, UNNEST (oa_locations)
WHERE genre="journal-article" AND year<2019 AND is_best=true
GROUP BY year, host_type, journal_is_in_doaj
"""
Like in the aforementioned blog article, we create a host column with the pandas provided loc method to highlight freely available full-texts provided by DOAJ in addition to the regular host types. Because DOAJ ensures high quality standards and peer reviews for their indexed journals, it might be interesting to see whether an increase of publishing in potentially less strict open access journals can be observed or not.
host_type_08_12_query_df = client.query(HOST_TYPE_08_12_QUERY).to_dataframe()
host_type_13_18_query_df = client.query(HOST_TYPE_13_18_QUERY).to_dataframe()
host_type_df = pd.concat([host_type_08_12_query_df, host_type_13_18_query_df])
host_type_df.year = pd.to_datetime(host_type_df.year.apply(lambda x: str(x) + "-01-01"))
host_type_df = host_type_df.sort_values(by=['year']).reset_index(drop=True)
host_type_df.loc[host_type_df['host_type'] == 'publisher', 'host'] = 'Other Journals'
host_type_df.loc[host_type_df['host_type'] == 'repository', 'host'] = 'Repositories only'
host_type_df.loc[host_type_df['journal_is_in_doaj'] == True, 'host'] = 'DOAJ-listed Journal'
host_type_df.head()
#> year host_type ... number_of_articles host
#> 0 2008-01-01 publisher ... 295155 Other Journals
#> 1 2008-01-01 publisher ... 92337 DOAJ-listed Journal
#> 2 2008-01-01 repository ... 258587 Repositories only
#> 3 2009-01-01 publisher ... 114857 DOAJ-listed Journal
#> 4 2009-01-01 repository ... 283108 Repositories only
#>
#> [5 rows x 5 columns]
Again, we visualize our data with matplotlib. Unlike ggplots for R, we have a more inconvenient way to prepare and plot our data in Python. This starts by iterating over the host types to generate a subplot for each. Although, we can make use of pandas plotting methods to display complicated graphics from DataFrames, it requires much effort to get publication quality figures.
all_articles = host_type_df.groupby(['year'])['number_of_articles'].sum() \
.reset_index(name='number_of_articles')
x = all_articles['year'].dt.year
y_total = all_articles.number_of_articles
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 3.5]
fig,(axes) = plt.subplots(nrows=1, ncols=3)
fig.suptitle('Open Access to Journal Articles by Unpaywall host', fontsize=16, y=1.10)
for i, host in enumerate(host_type_df.host.unique(), 1):
ax = plt.subplot(1,3,i)
y_stacked = host_type_df[host_type_df.host==host] \
.groupby(['year'])['number_of_articles'].sum() \
.reset_index(name='number_of_articles').number_of_articles
ax.bar(x, y_total, color='#b3b3b3a0')
ax.bar(x, y_stacked, color='#56B4E9')
ax.set_title(host, fontdict={'fontsize': 12, 'fontweight': 500})
ax.set_frame_on(False)
ax.grid(False, which='both', axis='x')
if i > 1:
ax.set_yticklabels([], visible=False)
# common xlabel
fig.text(0.45, -0.03, 'Year', ha='center',
fontdict={'fontsize': 13, 'fontweight': 500})
# common ylabel
fig.text(-0.02, 0.5, 'OA Articles (Total)', va='center', rotation='vertical',
fontdict={'fontsize': 13, 'fontweight': 500})
by_host_patch = mpatches.Patch(color='#56B4E9', label='by Host')
all_oa_patch = mpatches.Patch(color='#b3b3b3a0', label='All OA Articles')
plt.legend(handles=[all_oa_patch, by_host_patch], fontsize='medium', \
bbox_to_anchor=(1.05, 0.9), labelspacing=1.2)
fig.tight_layout()
plt.show()
Figure 2: Open access to journal articles by open access hosting location.
As can be seen from the figure, the proportion of open access articles that are accessible through journals which are not listed in DOAJ has rapidly increased in the year 2018 in comparison to the previous results. Nonetheless, the overall share of articles obtained from Journals that are not listed in DOAJ has decreased from 56% to 47%. Consequently, the share of open access articles provided by DOAJ-listed journals has increased.
In the following, we explore varieties between the evidence types of the two data dumps. The evidence type is a variable determined by Unpaywall, which expresses the location at which an article was found and how Unpaywall was able to identify the open access status of an article.
EVIDENCE_08_12_QUERY = f"""
SELECT evidence, year, is_best,
COUNT(distinct(doi)) AS number_of_articles
FROM {upw_08_12}, UNNEST (oa_locations)
WHERE genre="journal-article"
GROUP BY evidence, year, is_best
"""
EVIDENCE_13_18_QUERY = f"""
SELECT evidence, year, is_best,
COUNT(distinct(doi)) AS number_of_articles
FROM {upw_13_18}, UNNEST (oa_locations)
WHERE genre="journal-article" AND year < 2019
GROUP BY evidence, year, is_best
"""
evidence_08_12 = client.query(EVIDENCE_08_12_QUERY).to_dataframe()
evidence_13_18 = client.query(EVIDENCE_13_18_QUERY).to_dataframe()
evidence_df = pd.concat([evidence_08_12, evidence_13_18])
evidence_df.year = pd.to_datetime(evidence_df.year.apply(lambda x: str(x) + "-01-01"))
evidence_df.head()
#> evidence ... number_of_articles
#> 0 oa repository (semantic scholar lookup) ... 285658
#> 1 open (via page says license) ... 177227
#> 2 oa journal (via doaj) ... 138492
#> 3 oa repository (via pmcid lookup) ... 219378
#> 4 oa journal (via observed oa rate) ... 9818
#>
#> [5 rows x 4 columns]
For each evidence type we calculate the total number of articles and the related proportion, as well as the cumulative proportion regarding to the number of all articles. Then we display the results in form of a table. To specify the precision of the resulting floating point numbers we again make use of the Python built-in string formatting function. Here we specify by 2 digits of accuracy.
articles_per_type_df = evidence_df.groupby(['evidence']).number_of_articles \
.sum() \
.to_frame() \
.reset_index() \
.sort_values(by=['number_of_articles'], ascending=False) \
.reset_index(drop=True)
articles_per_type_df['prop'] = articles_per_type_df['number_of_articles'] \
.transform(lambda x:
x / articles_per_type_df['number_of_articles'] \
.sum() * 100)
articles_per_type_df['cumul'] = articles_per_type_df.prop.cumsum()
articles_per_type_table = articles_per_type_df.copy()
articles_per_type_table.prop = articles_per_type_table.prop \
.apply(lambda x: '{0:.2f}'.format(x))
articles_per_type_table.cumul = articles_per_type_table.cumul \
.apply(lambda x: '{0:.2f}'.format(x))
articles_per_type_table.columns = ['Evidence Types',
'Number of Articles',
'Proportion of all Articles in %',
'Cumulative Proportion in %']
articles_per_type_table
| Evidence Types | Number of Articles | Proportion of all Articles in % | Cumulative Proportion in % | |
|---|---|---|---|---|
| 0 | oa repository (semantic scholar lookup) | 5759641 | 20.12 | 20.12 |
| 1 | oa repository (via OAI-PMH doi match) | 4467382 | 15.6 | 35.72 |
| 2 | open (via free pdf) | 4427452 | 15.47 | 51.19 |
| 3 | open (via page says license) | 3653314 | 12.76 | 63.95 |
| 4 | oa journal (via doaj) | 3577409 | 12.5 | 76.45 |
| 5 | oa repository (via pmcid lookup) | 3237547 | 11.31 | 87.75 |
| 6 | oa repository (via OAI-PMH title and first author match) | 1699654 | 5.94 | 93.69 |
| 7 | oa journal (via observed oa rate) | 781574 | 2.73 | 96.42 |
| 8 | open (via crossref license) | 594003 | 2.07 | 98.5 |
| 9 | open (via page says Open Access) | 101344 | 0.35 | 98.85 |
| 10 | open (via free article) | 100145 | 0.35 | 99.2 |
| 11 | oa journal (via publisher name) | 74618 | 0.26 | 99.46 |
| 12 | open (via crossref license, author manuscript) | 64929 | 0.23 | 99.69 |
| 13 | oa repository (via OAI-PMH title match) | 51979 | 0.18 | 99.87 |
| 14 | oa repository (via OAI-PMH title and last author match) | 37218 | 0.13 | 100 |
| 15 | oa journal (via issn in doaj) | 392 | 0 | 100 |
| 16 | manual | 27 | 0 | 100 |
| 17 | hybrid (via page says license) | 1 | 0 | 100 |
Interestingly, the evidence type with the highest number of related articles was not included in the previous database snapshot.
sum_other = articles_per_type_df \
.loc[articles_per_type_df['prop'] < 1] \
.number_of_articles \
.sum()
articles_per_type_other_df = pd.DataFrame([['other', sum_other]],
columns=['evidence', 'number_of_articles'])
articles_per_type_keep_df = articles_per_type_df.loc[articles_per_type_df['prop'] > 1]
articles_per_type_grouped_df = pd.concat([articles_per_type_other_df,
articles_per_type_keep_df])
articles_per_type_grouped_df = articles_per_type_grouped_df \
.groupby(['evidence']) \
.number_of_articles.sum() \
.to_frame().reset_index() \
.sort_values(by=['number_of_articles'], ascending=False) \
.reset_index(drop=True)
articles_per_type_grouped_df['prop'] = articles_per_type_grouped_df['number_of_articles'] \
.transform(lambda x:
x / articles_per_type_grouped_df['number_of_articles'].sum() * 100)
articles_per_type_grouped_df['cumul'] = articles_per_type_grouped_df.prop.cumsum()
evidence_grouped_df = evidence_df.copy()
list_of_small_evidence_types = articles_per_type_df \
.loc[articles_per_type_df['prop'] < 1] \
.evidence.tolist()
evidence_grouped_df.evidence = evidence_grouped_df \
.evidence.replace(list_of_small_evidence_types, 'other')
evidence_grouped_df = evidence_grouped_df.groupby(['evidence', 'is_best', 'year']) \
.number_of_articles.sum() \
.to_frame().reset_index() \
.sort_values(by=['number_of_articles'], ascending=False) \
.reset_index(drop=True)
evidence_grouped_plot_df = evidence_grouped_df \
.groupby(['evidence', 'is_best']) \
.number_of_articles.sum() \
.to_frame().reset_index() \
.sort_values(by=['number_of_articles'], ascending=True) \
.reset_index(drop=True)
articles_per_type_grouped_plot_df = articles_per_type_grouped_df \
.sort_values(by=['number_of_articles'], ascending=True) \
.reset_index(drop=True)
y1 = articles_per_type_grouped_plot_df.evidence
x_total = articles_per_type_grouped_plot_df.number_of_articles
x_stacked = evidence_grouped_plot_df[evidence_grouped_plot_df.is_best == True] \
.set_index('evidence') \
.reindex(index=articles_per_type_grouped_plot_df['evidence']) \
.reset_index().number_of_articles
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [7, 5]
plt.grid(False, which='both', axis='y')
plt.box(False)
ax1 = plt.barh(y1, x_total, color='#b3b3b3a0')
ax2 = plt.barh(y1, x_stacked, color='#56B4E9')
plt.title('Number of Open Access Articles per Unpaywall Evidence Type',
fontdict={'fontsize': 15, 'fontweight': 600}, pad=20)
plt.xlabel('Number of Open Access Articles', labelpad=10,
fontdict={'fontsize': 13, 'fontweight': 500})
plt.ylabel('Evidence Type', labelpad=10, fontdict={'fontsize': 13, 'fontweight': 500})
is_best_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_best_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
plt.legend(handles=[is_best_patch, is_not_best_patch], title='Is best?',
fontsize='medium', bbox_to_anchor=(1.2, 1.05), labelspacing=1.2)
plt.show()
Figure 3: Number of articles per evidence type.
The figure provides two types of evidence that were not specifically mentioned in the previous work: ‘oa repository (semantic scholar lookup)’ and ‘oa journal (via observed rate)’. Apparently, these evidence types are responsible for the overall increase in articles between the two data dumps from February and November 2019.
y1 = evidence_grouped_df.sort_values(by=['year'], ascending=True) \
.reset_index(drop=False) \
['year'].dt.year.unique()
plt.style.use('seaborn-whitegrid')
fig, (axes) = plt.subplots(nrows=5,
ncols=2,
sharex=True,
sharey=True,
figsize=(12,12))
# get a one-dimensional array
axes = axes.reshape(-1)
fig.suptitle('Unpaywall Open Access Evidence Categories per Year', fontsize=16, y=1.05)
for i, ax in enumerate(axes, 1):
ax = plt.subplot(5,2,i, sharey=axes[0], sharex=axes[0])
if i % 2 == 0:
plt.setp(ax.get_yticklabels(), visible=False)
evidence = evidence_grouped_df.evidence.unique()[i-1]
x_total = evidence_grouped_df[evidence_grouped_df.evidence == evidence] \
.groupby(['year']) \
.number_of_articles.sum() \
.to_frame().reset_index() \
.number_of_articles
x_stacked = evidence_grouped_df[evidence_grouped_df.evidence == evidence] \
.loc[evidence_grouped_df.is_best == True] \
.groupby(['year']) \
.number_of_articles.sum() \
.to_frame().reset_index() \
.number_of_articles
ax = plt.bar(y1, x_total, color='#b3b3b3a0')
ax = plt.bar(y1, x_stacked, color='#56B4E9')
plt.title(evidence, fontdict={'fontsize': 12, 'fontweight': 500}, pad=0.2)
plt.grid(False, which='both', axis='x')
plt.box(False)
# common xlabel
fig.text(0.52, -0.03, 'Publication Year', ha='center',
fontdict={'fontsize': 13, 'fontweight': 500})
# common ylabel
fig.text(-0.04, 0.5, 'Number of Open Access Articles', va='center', rotation='vertical',
fontdict={'fontsize': 13, 'fontweight': 500})
is_best_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_best_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
fig.legend(handles=[is_best_patch, is_not_best_patch], title='Is best?',
fontsize='large', title_fontsize='x-large',
bbox_to_anchor=(1.15, 0.95), labelspacing=1.2)
plt.tight_layout()
plt.show()
Figure 4: Development of the number of articles per evidence type over time.
Owing to possible multiple associations between an article and evidence types in Unpaywall, we investigate the intersection between host types in the next step. Again, we compare the results with the previous results.
HOST_TYPE_INTERSECT_08_12_QUERY = f"""
SELECT year, host_type_count,
count(distinct(doi)) as number_of_articles
FROM
(SELECT doi, year,
STRING_AGG(DISTINCT(host_type)
ORDER BY host_type)
as host_type_count
FROM {upw_08_12}, UNNEST (oa_locations)
WHERE genre="journal-article"
GROUP BY doi, year)
GROUP BY host_type_count, year
ORDER BY number_of_articles desc
"""
HOST_TYPE_INTERSECT_13_18_QUERY = f"""
SELECT year, host_type_count,
count(distinct(doi)) as number_of_articles
FROM
(SELECT doi, year,
STRING_AGG(DISTINCT(host_type)
ORDER BY host_type) as host_type_count
FROM {upw_13_18}, UNNEST (oa_locations)
WHERE genre="journal-article" AND year < 2019
GROUP BY doi, year)
GROUP BY host_type_count, year
ORDER BY number_of_articles desc
"""
host_type_08_12_intersect_df = client.query(HOST_TYPE_INTERSECT_08_12_QUERY).to_dataframe()
host_type_13_18_intersect_df = client.query(HOST_TYPE_INTERSECT_13_18_QUERY).to_dataframe()
host_type_intersect_df = pd.concat([host_type_08_12_intersect_df,
host_type_13_18_intersect_df])
host_type_intersect_df.year = pd.to_datetime(host_type_intersect_df.year \
.apply(lambda x: str(x) + "-01-01"))
host_type_intersect_df \
.loc[host_type_intersect_df['host_type_count'] == 'publisher', 'host'] = 'Publisher only'
host_type_intersect_df \
.loc[host_type_intersect_df['host_type_count'] == 'publisher,repository',
'host'] = 'Publisher & Repository'
host_type_intersect_df \
.loc[host_type_intersect_df['host_type_count'] == 'repository',
'host'] = 'Repositories only'
articles_total_by_year_df = df.groupby(['year']).n.sum().to_frame().reset_index()
articles_total_by_year_df.columns = ['year', 'all_articles']
host_type_intersect_df = pd.merge(articles_total_by_year_df,
host_type_intersect_df, on='year', how='right')
host_type_intersect_df = host_type_intersect_df.groupby(['year', 'host']) \
.sum().eval('prop = number_of_articles/all_articles') \
.reset_index()
host_type_intersect_df.head()
#> year host ... number_of_articles prop
#> 0 2008-01-01 Publisher & Repository ... 184598 0.089253
#> 1 2008-01-01 Publisher only ... 202894 0.098099
#> 2 2008-01-01 Repositories only ... 258587 0.125027
#> 3 2009-01-01 Publisher & Repository ... 216379 0.095794
#> 4 2009-01-01 Publisher only ... 231117 0.102319
#>
#> [5 rows x 5 columns]
host_type_all = host_type_intersect_df.copy()
host_type_all['prop'] = host_type_intersect_df.groupby(['year'])['prop'] \
.transform(lambda x: x.sum())
This time we are using the seaborn package for generating bar plots. Seaborn is a visualization library which is build on top of matplotlib. It is well designed to fit with pandas DataFrames and it also enables smoother plots. Also, we can continue to use matplotlib methods.
x = host_type_intersect_df['year'].dt.year
y = host_type_all.prop
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 3.5]
fig,(axes) = plt.subplots(nrows=1, ncols=3)
fig.suptitle('Overlap between Open Access Host Types in Unpaywall', fontsize=16, y=1.10)
for i, host in enumerate(host_type_intersect_df.host.unique(), 1):
ax = plt.subplot(1,3,i)
y_stacked = host_type_intersect_df[host_type_intersect_df.host == host].prop
sns.barplot(x, y, color='#b3b3b3a0', alpha=0.8, saturation=1, ci=None)
sns.barplot(x, y_stacked, color='#56B4E9', alpha=1, saturation=1, ci=None)
ax.set_title(host, fontdict={'fontsize': 12, 'fontweight': 500})
ax.set_frame_on(False)
ax.grid(False, which='both', axis='x')
ax.set(xlabel='', ylabel='')
for label in ax.get_xticklabels()[1::2]:
label.set_visible(False)
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
if i > 1:
ax.set_yticklabels([], visible=False)
# common xlabel
fig.text(0.45, -0.03, 'Year', ha='center',
fontdict={'fontsize': 13, 'fontweight': 500})
# common ylabel
fig.text(-0.02, 0.5, 'OA Share', va='center', rotation='vertical',
fontdict={'fontsize': 13, 'fontweight': 500})
by_host_patch = mpatches.Patch(color='#56B4E9', label='by Host')
all_oa_patch = mpatches.Patch(color='#b3b3b3a0', label='All OA Articles')
plt.legend(handles=[all_oa_patch, by_host_patch],
bbox_to_anchor=(1.05, 0.9), labelspacing=1.2)
plt.tight_layout()
plt.show()
Figure 5: Open access to journal articles by open access hosting location.
The figure shows that, in terms of percentage, fewer articles were found on publisher websites than in the results of the blog article. Indeed, 73% of all open access full-texts are available through publisher websites which is a total of 9,970,245 articles. This is a decrease by almost 9% compared to the previous results. Also, the proportion of articles that are not archived in a repository has declined from 56% to 35%. Consequently, the proportion of articles that are available from both the publisher websites, as well as those that are archived in a repository have increased from 26% to 39%.
EVIDENCE_SINGLE_CAT_08_12_QUERY = f"""
SELECT ev_cat, COUNT(DISTINCT(doi)) AS number_of_articles
FROM
(SELECT doi, STRING_AGG(DISTINCT(evidence), "&"
ORDER BY evidence) AS ev_cat
FROM {upw_08_12}, UNNEST (oa_locations)
WHERE genre="journal-article"
GROUP BY doi
)
GROUP BY ev_cat
"""
EVIDENCE_SINGLE_CAT_13_18_QUERY = f"""
SELECT ev_cat, COUNT(DISTINCT(doi)) AS number_of_articles
FROM
(SELECT doi, STRING_AGG(DISTINCT(evidence), "&"
ORDER BY evidence) AS ev_cat
FROM {upw_13_18}, UNNEST (oa_locations)
WHERE genre="journal-article" AND year < 2019
GROUP BY doi
)
GROUP BY ev_cat
"""
evidence_categories_08_12_df = client.query(EVIDENCE_SINGLE_CAT_08_12_QUERY).to_dataframe()
evidence_categories_13_18_df = client.query(EVIDENCE_SINGLE_CAT_13_18_QUERY).to_dataframe()
evidence_categories_df = pd.concat([evidence_categories_08_12_df,
evidence_categories_13_18_df])
evidence_categories_df = evidence_categories_df.groupby(['ev_cat']) \
.sum().reset_index() \
.sort_values(by=['number_of_articles'], ascending=False) \
.reset_index(drop=True)
evidence_categories_df.head()
#> ev_cat number_of_articles
#> 0 open (via free pdf) 2042872
#> 1 oa repository (semantic scholar lookup) 1233239
#> 2 oa repository (semantic scholar lookup)&open (... 867351
#> 3 oa repository (via OAI-PMH doi match)&oa repos... 535673
#> 4 oa journal (via doaj)&open (via page says lice... 533464
evidence_single_cat_df = evidence_df.groupby(['evidence']) \
.number_of_articles.sum() \
.reset_index()
evidence_single_cat_df = pd.merge(evidence_single_cat_df, evidence_categories_df,
how='left', left_on=['evidence'], right_on=['ev_cat']) \
.drop(['ev_cat'], axis=1)
evidence_single_cat_df.columns = ['evidence', 'number_of_articles', 'number_of_single_cat']
evidence_single_cat_df.evidence = evidence_single_cat_df \
.evidence.replace(list_of_small_evidence_types, 'other')
evidence_single_cat_grouped_df = evidence_single_cat_df.groupby(['evidence']) \
.sum() \
.eval('prop = number_of_single_cat/number_of_articles') \
.reset_index() \
.sort_values(by=['number_of_articles'], ascending=False) \
.reset_index(drop=True)
evidence_single_cat_grouped_df
#> evidence ... prop
#> 0 oa repository (semantic scholar lookup) ... 0.214117
#> 1 oa repository (via OAI-PMH doi match) ... 0.098287
#> 2 open (via free pdf) ... 0.461410
#> 3 open (via page says license) ... 0.126779
#> 4 oa journal (via doaj) ... 0.115270
#> 5 oa repository (via pmcid lookup) ... 0.033182
#> 6 oa repository (via OAI-PMH title and first aut... ... 0.219231
#> 7 oa journal (via observed oa rate) ... 0.299300
#> 8 open (via crossref license) ... 0.460794
#> 9 other ... 0.417280
#>
#> [10 rows x 4 columns]
y1 = evidence_single_cat_grouped_df \
.sort_values(by=['number_of_articles'], ascending=True) \
.reset_index(drop=True).evidence
x_total = evidence_single_cat_grouped_df.groupby(['evidence']) \
.number_of_articles.count() \
.to_frame().reset_index() \
.sort_values(by=['number_of_articles'], ascending=True) \
.reset_index(drop=True) \
.number_of_articles
x_stacked = evidence_single_cat_grouped_df \
.sort_values(by=['number_of_articles'], ascending=True) \
.reset_index(drop=True) \
.prop
fig, ax = plt.subplots(figsize=(7,5))
plt.style.use('seaborn-whitegrid')
#plt.rcParams['figure.figsize'] = [7, 5]
plt.grid(False, which='both', axis='y')
plt.box(False)
ax1 = plt.barh(y1, x_total, color='#b3b3b3a0')
ax2 = plt.barh(y1, x_stacked, color='#56B4E9')
ax.xaxis.set_major_formatter(mtick.PercentFormatter(1.0))
plt.title('Proportion of Articles per Evidence Type',
fontdict={'fontsize': 15, 'fontweight': 600}, pad=20)
plt.xlabel('Proportion of Articles', labelpad=10,
fontdict={'fontsize': 13, 'fontweight': 500})
plt.ylabel('Evidence Type', labelpad=10, fontdict={'fontsize': 13, 'fontweight': 500})
is_unique_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_unique_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
plt.legend(handles=[is_unique_patch, is_not_unique_patch], title='Is unique?',
fontsize='medium', bbox_to_anchor=(1.2, 1.05), labelspacing=1.2)
plt.show()
Figure 6: Proportion of articles per evidence type.
While openly forms of open access provision identified via pdf or license are still evidence types with a most often unique appearance, it can be seen that the share has decreased, compared to the results from the previous database snapshot. On the other hand, the share of unique occurences of less frequent evidence types which were collated in the category other has increased rapidly.
To visualize the intersection between multiple evidence types, we are using the UpSetPlot package which is better maintained in contrast to py-upset. It offers better support for pandas and is also well documented. Especially the input format for the UpSet plot is well described.
The input format can be generated with the from_memberships method. It accepts two parameters: a nested list with elements corresponding to a set and secondly a list containing additional data which has the same length as the nested list.
evidence_categories_upset_df = evidence_categories_df.groupby(['ev_cat']) \
.sum().reset_index() \
.sort_values(by=['number_of_articles'],
ascending=False) \
.reset_index(drop=True)
# subset of fifteen most frequent distinct evidence types combinations
evidence_categories_upset_most_frequent = evidence_categories_upset_df[:15]
ev_list = evidence_categories_upset_most_frequent.ev_cat.tolist()
ev_list = [ev.split('&') for ev in ev_list]
n_list = evidence_categories_upset_most_frequent.number_of_articles.tolist()
evidence_categories_upset_expr = upsetplot.from_memberships(ev_list, data=n_list)
fig = plt.figure(figsize=(10,6))
axes = upsetplot.UpSet(evidence_categories_upset_expr,
sort_by='cardinality',
sort_categories_by='cardinality',
element_size=20,
intersection_plot_elements=15,
totals_plot_elements=7
).plot(fig=fig)
ax1 = axes['intersections']
ax1.yaxis.grid(False)
ax2 = axes['shading']
ax3 = axes['matrix']
ax4 = axes['totals']
ax4.xaxis.grid(False)
# reduce overlap with text
plt.subplots_adjust(left=0, bottom=0, right=1.1, top=1, wspace=0, hspace=0)
fig.text(0.1, -0.07, 'Set Size', ha='center',
fontdict={'fontsize': 10, 'fontweight': 500})
# delete shading axis for better readability
fig.delaxes(ax2)
plt.show()
Figure 7: Most frequent combinations of evidence types.
In this blog post, we provided an updated analysis of open access evidences in Unpaywall by examining database snapshots obtained from Unpaywall. Using Python, we were able to find 13,569,137 scholarly articles in Unpaywall in the period from 2008 to 2018 that are freely available. Fortunately, we can compare these results with the previous database snapshot from February 2019 which were described in the mentioned blog article. Thusly we can interpret upcoming trends more precisely.
Based on the previous results, our analysis exposes an increase of open access full-texts by 6% from 37% to 43%. Furthermore, we were able to identify new evidence types that were introduced in the latest database snapshot. Altogether, this suggests that more had articles left closed access and Unpaywall was able to link more DOI’s to articles found in the newly introduced evidence types.
Additionally, this work demonstrates similarities between data analysis in Python and R. Due to the fact that this work is based on translated R source code, we´ve learned plenty of approaches to face upcoming data analysis projects.